Multi-tier message correlation

ABSTRACT

A system and method determines correlations within multi-tier communications based on repeated iterations/episodes of executions of a target application. Content-based correlations are determined by encoding the content using a finite alphabet, then searching for similar sequences among the multiple traces. By encoding the content to a finite alphabet, common pattern matching techniques may be used, including, for example, DNA alignment algorithms. To facilitate alignment of the traces, structural and/or semantic breakpoints are defined, and the encoding in each trace is synchronized to these breakpoints. To facilitate efficient processing, a hierarchy of causality among tier-pairs is identified, and messages at lower levels are ranked and temporally filtered, based on activity intervals at higher levels of the hierarchy.

This application is a continuation of, and claims priority to, U.S.patent application Ser. No. 13/117,105, filed 26 May 2011 (to be issuedas U.S. Pat. No. 8,756,312 on 17 Jun. 2014). U.S. patent applicationSer. No. 13/117,105 is a non-provisional of, and claims priority to,U.S. Provisional Patent Application 61/348,875, filed 27 May 2010.

BACKGROUND AND SUMMARY OF THE INVENTION

This invention relates to the field of application performance analysis,and in particular to a method and system for identifying message streamscorresponding to a transaction that includes communications betweenmultiple tiers.

The ever-increasing use of applications that operate on a network hasincreased the need for application performance analysis systems that canassess the efficiency of transactions that utilize the network.

In a typical network-based application, a user executes the applicationat a client device, and in the process of executing the application,messages are communicated between the client and one or more servers.These messages are generally interspersed among messages from otherapplications being executed at the same time by the user, or by otherusers. To determine the performance of transactions of a particularapplication, the messages corresponding to the communications related toeach transaction are distinguished from the other messages, so thatperformance data, such as delay times, can be collected.

A number of techniques are commonly used to distinguish messages relatedto transactions of an application, including, for example,distinguishing the source and destination addresses associated with theclient and server(s) of each transaction. Such techniques, however, areunable to identify ‘secondary’ or ‘consequential’ communicationsassociated with such transactions. That is, for example, a message fromthe client to a server may cause the server to contact another server,such as a database server. The resultant communications between theservers will not generally include a reference to the client, andtechniques that rely upon distinguishing messages to or from the clientwill not be able to associate these communications with the transaction.

For ease of understanding and reference, the terms ‘tier’ and‘tier-pair’ are used to identify the relationship among communicatingelements. In the above example, the client is at a first tier (e.g. auser tier); the servers that the client communicates directly with areat a second tier (e.g. a web server tier); the servers that the serversat the second tier communicate directly with are at a third tier (e.g. adatabase server tier); and so on. A pair of elements that communicatedirectly is termed a ‘tier-pair’. Note that the terms ‘client’,‘server’, ‘database’, etc. are used herein to facilitate understanding;the particular elements at any given tier may comprise any type ofdevice with communication capability.

U.S. Pat. No. 7,729,256, “CORRELATING PACKETS”, issued 1 Jun. 2010 toPatrick J. Malloy, Michael Cohen, and Alain J. Cohen, discloses a methodfor determining (or approximating) which messages correspond to aparticular transaction from among other messages in a set of multi-tiercommunication traces. The particular transaction is characterized ascomprising a sequence of ‘reference’ packets, which is a sequence ofpackets among tier-pairs that typically occur during execution of theapplication, such as illustrated in FIG. 1A. For example, the referencesequence indicated by arrow 1 may correspond to a typical client's(Client A) request to a server (Web-Server B) for data, the server'srequest (arrow 3) to a database server (DB Server D), the databaseserver's communication of the data (arrows 4) to the requesting server,and the requesting server's communication of this data (arrow 6) to therequesting client. The other arrows in the reference sequence FIG. 1Ainclude, for example, communication of other requests, data,acknowledgements, and so on. These reference sequences may be based on asimulation of the application, or the operation of the application in acontrolled, or isolated environment.

FIG. 1B illustrates the sequence of communications 1, 2, 3 . . . 9corresponding to a transaction that occurs during the execution of theapplication on an actual network. As illustrated, the sequence is maskedby other communications occurring between the tier-pairs A-B and B-D. Asdisclosed in U.S. Pat. No. 7,729,256, sets of traces of communicationsbetween tiers in the actual network are analyzed to find a sequence inthe traces that appears to be similar to the reference sequence, basedon a measure of correlation between possible sequences in the traces andthe reference sequence. The correlation may be based on factors such asinformation in the header of the packets, the size of the packets, keywords or phrases in the packets, and so on.

The use of a reference sequence to find a matching sequence of packetsin a production environment, however, requires the creation and/oridentification of a sequence that is representative of a transaction orset of transactions that are likely to occur during the execution of theapplication of interest, as illustrated in FIG. 1A. In someapplications, particularly ‘static’ applications, this may be a fairlystraightforward task. In ‘dynamic’ applications, such as highlyinteractive applications, the transactions may differ based on theparticular user, or the particular tasks performed within theapplication. In such a dynamic environment, different referencesequences may need to be defined, each reference sequence being specificto a particular user, or a particular task.

Also, because the specific content of a sequence of packets can beexpected to differ among different users of an application, the use ofcorrelation factors based on content is fairly limited when usingpre-defined reference sequences.

It would be advantageous to be able to identify sequences associatedwith transactions of an application in a production environment withouthaving to identify a reference sequence a priori. It would also beadvantageous to be able to automatically identify characteristicsequences within multiple traces of executions of an application atdifferent times.

These advantages, and others, can be realized by a system and methodthat determines correlations within multi-tier communications based onrepeated iterations of a user transaction. Content-based correlationsare determined by encoding the content using a finite alphabet, thensearching for similar sequences among the multiple traces. By encodingthe content to a finite alphabet, common pattern matching techniques maybe used, including, for example, DNA alignment algorithms. To facilitatealignment of the traces, structural and/or semantic breakpoints aredefined, and the encoding in each trace is synchronized to thesebreakpoints. To facilitate efficient processing, a hierarchy ofcausality among tier-pairs is identified, and messages at lower levelsare ranked and temporally filtered, based on activity intervals athigher levels of the hierarchy.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention is explained in further detail, and by way of example,with reference to the accompanying drawings wherein:

FIGS. 1A-1B illustrates an example of finding a reference sequencewithin a set of multiple-tier traces.

FIG. 2 illustrates an example flow diagram for finding repeated messagecontent in a set of message traces in accordance with this invention.

FIG. 3 illustrates an example mapping of a segment of a message into alimited alphabet set.

FIG. 4 illustrates an example flow diagram for aligning a pair ofsequences.

FIG. 5 illustrates an example determination of a longest common sequence(LCS) of k-tuples within a pair of sequences.

FIG. 6 illustrates an example block diagram of a system for identifyingrepeated message content in a set of message traces in accordance withthis invention.

FIG. 7 illustrates an example set of communications among a variety oftier-pairs.

FIG. 8 illustrates an example of filtering and ranking messages andactivity intervals based on a causal hierarchy among tier-pairs.

Throughout the drawings, the same reference numerals indicate similar orcorresponding features or functions. The drawings are included forillustrative purposes and are not intended to limit the scope of theinvention.

DETAILED DESCRIPTION

In the following description, for purposes of explanation rather thanlimitation, specific details are set forth such as the particulararchitecture, interfaces, techniques, etc., in order to provide athorough understanding of the concepts of the invention. However, itwill be apparent to those skilled in the art that the present inventionmay be practiced in other embodiments, which depart from these specificdetails. In like manner, the text of this description is directed to theexample embodiments as illustrated in the Figures, and is not intendedto limit the claimed invention beyond the limits expressly included inthe claims. For purposes of simplicity and clarity, detaileddescriptions of well-known devices, circuits, and methods are omitted soas not to obscure the description of the present invention withunnecessary detail.

FIG. 2 illustrates an example flow diagram for finding repeated messagecontent in a set of message traces in accordance with this invention.The invention is premised on the assumption that if the same transactionis executed at different times, a number of messages occurring duringeach execution of the transaction will contain similar content,particularly if the transaction is executed by the same person, or aperson in a similar position or context. At 210, traces from the tierpairs that are likely to be used by the transaction are captured duringeach episode of execution of the transaction. In this example, thetraces from three separate execution episodes are captured, although oneof skill in the art will recognize that any number of episodes greaterthan one may be captured. Capturing the traces from at least threeepisodes provides a higher degree of confidence that the identifiedmessages within the traces are actually associated with transactionsassociated with the application.

The traces are segregated by tier-pair and direction, so that messagestraveling from a given source tier to a given destination tier in eachepisode can be compared with each other to identify messages havingsimilar content in the three episodes. In a preferred embodiment, thetraces of the different tier-pairs are synchronized to a common timingbase, so that a time ordering of occurrences at each tier pair can beestablished. U.S. Pat. No. 7,570,669, “AUGMENTATION TO A METHOD FORMERGING/SYNCHRONIZING PACKET TRACES, INCLUDING MANUAL SYNCHRONIZATION”,issued 4 Aug. 2009 to Patrick J. Malloy and Antoine D. Dunn, disclosesdetermining a common time base among nodes in a network by iterativelypropagating timing constraints among the nodes, and determining atime-shift to apply to the time base of each node that conforms to theseconstraints, and is incorporated by reference herein.

At 220, the traces may be filtered. With a common timing base, theextent of searching for common message content may be controlled,thereby improving the efficiency of the comparison process. Asillustrated in FIG. 7, for example, communications at the ‘first’ tierpair A-B between a particular client/user at tier A and a server at tierB can generally be easily identified. In general, messages that occur onthe next ‘lower’ tier pair B-C before an initial communication 710 fromthe client to the server can be ignored, because they could not be inresponse to the client's request to the server. Such “ignorable”messages 720 are identified in FIG. 7 using dashed lines. In likemanner, if a message 730 at the uppermost tier-pair A-B can beidentified as a termination of a particular transaction, or activityinterval 750, between the client and server, messages 740 on the tierpair B-C after this termination message 730 may also be ignored. Othertechniques for identifying intervals of time that can be ignored mayalso be used.

In a preferred embodiment of this invention, messages or transactions ateach tier-pair are filtered and rank-scored based on parametersassociated with messages at other tier-pairs. In particular, a‘hierarchy’ of tier-pairs is defined relative to an execution of aparticular application, and the messages at tier-pairs at lower levelsof the hierarchy are filtered and ranked based on the parametersassociated with messages or activity intervals at tier-pairs at higherlevels of the hierarchy.

FIG. 8 illustrates an example of filtering and ranking messages andactivity intervals at lower levels of a hierarchy of tiers based onactivity intervals at higher levels of the hierarchy. In this example, asimple hierarchy A-B-C-D-E is illustrated, but one of skill in the artwill recognize that other tier-pair arrangements may exist, such astier-pair B-D for communications directly between tier B and D, withouthaving to pass through intermediate node C. For the purposes of thisinvention, the term hierarchy is used in a general sense, commonlyillustrated as a directed acyclic graph that indicates an assumed orpotential causal relationship, or chain of communication, amongtier-pairs. That is, in FIG. 8, for example, it is assumed that messagesfrom tier A to tier B may cause messages to be sent from tier B to tierC (or other tier), and thus tier-pair B-C is at a lower level of thehierarchy with regard to messages from tier A to tier B.

Within each tier-pair, messages are assessed to identify discreteactivity intervals. For example, an activity interval may be identifiedby determining a maximal set of consecutive request-response pairs thatoccur in close time-proximity to each other. That is, if there is a longgap of time between one request-response pair and another, the secondrequest-response pair is likely to be the start of a new activityinterval. Note that this partitioning of messages into activityintervals is primarily a means of reducing the amount of data that needsto be processed, by grouping multiple messages into such activityintervals, and need not be precise. If a short ‘inactivity interval’ 850is used, more groups will be formed; if a long inactivity interval 850is used, unrelated activities may be grouped into a single activity. Ina preferred embodiment, the system may present the results of thispartitioning, and allow the user to adjust the duration of theinactivity interval 850. Similarly, the system may use heuristic andother techniques to automate the determination of a suitable inactivityinterval 850.

After identifying the activity intervals at each tier-pair, eachactivity interval is scored based on its relationship to the activityintervals of its upper tier-pairs. Illustrated in FIG. 8 are four typesof regions 810, 820, 830, 840 that may categorize such activityintervals at lower-level tier-pairs. If a lower-level activity intervaloccurs in a region 810 that is well within the activity interval 750 ofits parent there is no apparent reason to assume that the activityintervals in this region 810 are not related to messages in the activityinterval 750. However, activity intervals in regions 820, 830 near thebeginning or end of the activity interval 750 may be less likely to havebeen associated with the messages of activity interval 750, due to theimprecise nature of the definition of activity intervals, particularlyat lower levels of the hierarchy. Activity intervals that are withinregion 840 and well outside the activity interval 750 are very unlikelyto be associated with the upper level activity interval 750.

In an embodiment of this invention, the activity intervals oflower-level tier-pairs are scored based on a variety of criteria,including, for example, determining an amount of overlap between theactivity intervals. A lower level activity interval that is totallycontained within the upper-level activity interval will score highly;one that is only partially contained within the upper-level activityinterval will score lower. Additional scoring techniques may alsoinclude granting ‘bonus’ points to activity intervals at the lower-levelthat begin very near to the beginning of the upper-level activityinterval, as well as to lower-level activity intervals that end verynear to the end of the upper-level activity interval. In like manner,‘penalty’ points may be assessed against lower-level activity intervalsthat start before the start of its upper-level activity interval, oragainst lower-level activity intervals that end after the end of itsupper level activity interval.

Other scoring and ranking schemes may also be used. For example, theduration of a given activity interval may be attributable to activitiesat levels at or below the particular tier-pair. That is, time is eitherbeing consumed by processing at the particular tier, or processing andcommunication at tiers below the particular tier. Accordingly, a lowerlevel activity interval, or set of activity intervals, that “fills” theupper level activity interval, thereby accounting for the time of theupper level activity interval, may be scored higher than an activityinterval or set of intervals that do not account for the entire durationof the upper level activity interval.

After scoring all of the activity intervals, the resultant scores areused to determine whether the content of the messages contained withineach activity interval is to be subsequently processed. In astraightforward embodiment of this feature, only messages in activityintervals that score higher than a given minimum score are consideredfor subsequent analysis. In another embodiment, the scores are used torank the activity intervals, and only messages in the “Top-N” activityintervals at each level are considered for subsequent analysis.

Having identified messages that may be related to the transaction beingassessed, the content of these messages in multiple episodes of thetransaction are assessed to identify substantially similar messages inall of these episodes, as detailed further below with regard to the flowdiagram of FIG. 2.

As noted above, except in fairly static situations, the content ofmessages associated with repeated executions of an application willrarely be ‘identical’, and therefore a search for identical messageswithin each episode is not likely to be successful for typicalexecutions of an application of moderate complexity. Therefore, inaccordance with a feature of this invention, some or all of the contentof each message is encoded using a finite-alphabet, at 230, and thesearch for similar messages is based on a comparison of thesefinite-alphabet encodings of each message in each episode, at 240.

The use of a finite-alphabet encoding in lieu of the actual messagecontent provides potential advantages with regard to the time andcomplexity required to compare the content of messages, as well as withregard to finding ‘similar’ but not ‘identical’ messages. In a preferredembodiment of this invention, multiple bytes of a message are encodedinto a single ‘letter’ of the finite-alphabet. In a text message, forexample, words will be encoded using substantially fewer letters, andthe occurrence of the same word in messages in multiple episodes of anapplication can be identified as the occurrence of these fewer lettersin the encoded versions of the messages. In like manner, difference inthe content of the messages may be identified by differences in thefewer letters. In non-text messages, a similar efficiency is achieved byencoding multi-byte sequences into a single letter for comparison withsimilarly encoded multi-byte sequences.

At 240, the encoded messages in two of the episodes are compared to findmatching sequences of encoded letters of the finite-alphabet text toidentify one or more longest common sequences (LCS) within the encodedmessages. In this example, the encoded messages of the second and thirdepisodes are compared, but one of skill in the art will recognize thatany pair of episodes may be compared. Any of a number of techniques maybe used to perform the comparison and determine the LCS(s), as detailedfurther below, including those commonly used to compare DNA sequences.

At 250, the process of 230-240 is repeated, using the encoded messagesof the remaining episode (in this example, the first episode) and thedetermined LCS (in this example, the LCS of the second and thirdepisodes), to determine a longest common sequence (LCS) corresponding tothe combination of the encodings of the communications that occurred ineach of the three episodes.

At 260, and at other stages of the example process, the determined LCSmay optionally be analyzed/filtered to accommodate false negatives inthe alignment process and/or reduce the effects of false positives inthis process. For example, if the size of the limited-alphabet set issmall, the likelihood of different original sequences being encoded intothe same set of encoded letters is relatively higher than in a largersized limited-alphabet set.

Having determined a sequence that appears to be repeated in the encodedmessages of the three episodes of an application, the correspondingmessages in at least one of the episodes (e.g. episode 1) areidentified, at 270. This identification of messages of a transactioncorresponding to the execution of the application may subsequently beprovided to other analysis systems to perform any of a variety of tasks,including determining timing and delay characteristics associated withthe transaction, determining changes in either the application or thenetwork that may improve these characteristics, and so on. CopendingU.S. patent application Ser. No. 12/060,271, “NETWORK DELAY ANALYSISINCLUDING PARALLEL DELAY EFFECTS”, filed 1 Apr. 2008 for NIEMCZYK etal., incorporated by reference herein, for example, discloses a varietyof techniques for identifying dependencies among messages in amulti-tier environment, and subsequently identifying possibleimprovements to the network taking these dependencies into account.

These identified messages may also be provided as a ‘reference sequence’in an embodiment of the aforementioned “CORRELATING PACKETS” patent(U.S. Pat. No. 7,729,256) for analyses of subsequent executions of theapplication. As noted above, different users of an application may oftenhave different characteristic sequences, and this invention could enablethe creation of different reference sequences for each particular useror class of users. In like manner, the above described technique ofidentifying similar messages based on a limited-alphabet encoding ofmessage content may be used in an embodiment of the “CORRELATINGPACKETS” patent for providing a measure of correlation betweenindividual packets based on message content.

FIG. 3 illustrates an example encoding of a message 310 into alimited-alphabet message 330. In FIG. 3, the message 310 is illustratedin two forms, a text form 310 and an equivalent hexadecimal form 310′,each two-digit hexadecimal number in message 310′ corresponding to anASCII encoding of the characters in the message 310. For example, thefirst word (“GET”) in message 310 corresponds to the first three ASCIIbytes (47, 45, 54) in message 310′.

In accordance with a feature of this invention, ‘breakpoints’ may bedefined to facilitate aligning of the content among the messages of themultiple episodes. Because the content of the message 310 is beingencoded into a limited-alphabet text, an offset of as little as one bytein the original message of the two episodes being compared will likelyproduce a completely different encoding of these two messages. By usingdefinable breakpoints, the impact of such offsets can be limited to theinterval between breakpoints. The breakpoints may include both‘structural’ breakpoints and ‘semantic’ breakpoints. A structuralbreakpoint may be, for example, the end of each packet, or an imposedbreakpoint after a given number of bytes. A semantic breakpoint, on theother hand, may be a commonly occurring character or symbol within theexpected content, such as a “space” character in a text document, or an“end of record” character in a database file.

In the example of FIG. 3, the occurrence of a “space” (ASCII “20”) inthe text of the message 310 is defined as a breakpoint; in this manner,the encoding of the message will generally correspond to an encoding ofeach individual word. One of skill in the art will recognize thatalternative or additional breakpoints may also be defined. For example,in a text file, the end of a line and start of a new line is usuallyencoded as a “Carriage Return” (“CR—ASCII “0D”)—“Line Feed” (“LF”—ASCII“0A”) or vice versa. One could define any or all of these characters, orsequence of characters, as breakpoints to assure that the start of eachnew line re-synchronizes the comparison process. In like manner, in anon-text file, such as a non-text database file, the symbols used toindicate the start and/or end of each data record may be used asbreakpoints.

As noted above, a preferred encoding of the original message encodes aplurality of bytes in the original message into a single letter of thelimited-alphabet set. For ease of reference, the term ‘block’ is used toidentify the plurality of bytes that are encoded into a single letter.The block-size may be determined based on any number of factors. A largeblock-size will result in a high degree of ‘compression’ of the originalmessage into a much smaller encoded message, thereby reducing the numberof letters that must be compared between encoded messages of thedifferent episodes. However, the likelihood of two relatively longsequences of bytes in the two messages being identical to each other(thereby producing the same encoded letter) is reduced, compared to asmaller block-size. In general, the nature of the messages associatedwith a given transaction of an application will determine theappropriate balance between reducing the size of the messages to becompared and improving the likelihood of successful matches. If thenature of the messages associated with a given transaction istext-based, a block size of four to eight may be preferred, because theaverage size of a word is generally between four and eight characters.If the messages are non-text database records, on the other hand, theaverage size of the record-header, or record-descriptor, may be used todetermine an appropriate block size.

At 320, the partitioning of the message 310 based on a five-characterblock size and the use of a space character (“20”) as a breakpoint inthe message 310′ is illustrated. Upon each occurrence of a spacecharacter, a new five-byte block is started. As illustrated at 320, theword “GET” (47 45 54) forms a first block, then a new block is startedwhen the space (20) after “GET” occurs. A subsequent new block isstarted when the space (20) occurs after the “/” (2F) character. Thenext set of characters “HTTP/1.1”, followed by an end of line (CR-LF; 0D0A) and the word “Accept” does not contain a space (20), and thus formsthree complete blocks and a partially filed block corresponding to thelast three letters (“ept”) before the space.

As illustrated in the example of 320, each byte in the original message310′ is included within the blocks, with the breakpoint character (20)appearing at the start of a new block. However, alternative schemes maybe used to partition the content of the original message. For example,the character(s) used as breakpoints could be placed in the prior block,rather than at the start of the new block, or could be eliminatedcompletely. In like manner, commonly occurring “noise” words, such asarticles and pronouns may be omitted to avoid different messagesappearing to be similar. In like manner, if it is known that the message310 is a text file, all of the characters may be converted to eitherupper-case or lower-case, and punctuation marks may be omitted. Theseand other techniques for improving the efficiency of the encoding andcomparison process will be evident to one of skill in the art in view ofthis disclosure.

The partitioned blocks are subsequently encoded into letters of alimited-alphabet set, using any number of encoding techniques.Typically, a hash function having an output range that corresponds tothe size of the alphabet may be used; as each hash value is produced, acorresponding letter, or equivalently, the hash value itself, is storedas the encoded message 330. The particular hash function used isimmaterial, but one that is sensitive to the actual sequence of bytes inthe block is generally preferred, so that, for example, “abcde” does notnecessarily produce the same letter as “badec”. In like manner, a hashfunction that provides a somewhat uniform distribution of encodedletters when the original message is somewhat typical of an expecteddistribution of sequences of bytes is also preferred. One of skill inthe art will recognize that hash functions having particular outputcharacteristics relative to the characteristics of their input variablesare common in the art.

Incomplete blocks may be encoded or omitted, typically depending uponthe expected form or content of the original messages and/or dependingupon the degree of incompletion. For example, the rule may be that allblocks are encoded, an incomplete block that is more than half full maybe encoded, no incomplete blocks are encoded, etc. Depending upon theencoding process (hashing function) used, incomplete blocks may need tobe “filled”, using, for example, spaces to complete the block. Theparticular rules for dealing with incomplete blocks are somewhatimmaterial, provided, of course, that the same rules are applied foreach episode's messages, and provided that the subsequent matchingprocess does not impose constraints with regards to ‘gaps’ in sequences.

In the example of FIG. 3, incomplete blocks are not encoded, asillustrated by the “.” in the corresponding block encoding area. In thisexample, a ten-letter (a-j) alphabet set is used, and the third block(20 48 54 54 50) is hashed to a value of 06, corresponding to the letter“f” at the third block area of 330. In like manner, the fourth block (2F31 2E 31 0D) is hashed to a value of 02, corresponding to the letter“b”. In this example, the encoding of the message 310, corresponding toa message in one episode, produces the sequence “fbdddfehgcidd”. Thesubsequent sequence matching process will use this sequence to determinewhether an encoded message in another episode includes a similarsequence, as detailed further below. In this example, a comparison of aneighty character message 310 is reduced in complexity to a comparison ofa thirteen character encoded sequence 330.

One of skill in the art will recognize that the block partitioning andencoding into a single letter may be provided as a single function, suchthat the separate representation illustrated in 320 may never actuallybe produced. Similarly, one of skill in the art may also recognize thata fixed block size need not be used. For example, the beginning of eachline, or each data record may be partitioned into a block that capturesa descriptor (such as a “GET” command, or a data-type) regardless ofsize, with the remainder of the line being partitioned into blocks basedon other criteria, such as the aforementioned fixed sized blocks. Theparticular technique used to partition the original message is somewhatimmaterial, provided that the same technique is used for messages ineach of the episodes being compared, and provided that the encodingprocess is compatible with the blocks produced. In like manner,different blocking and/or encoding techniques may be used for messagesat different tier-pairs, or messages between particular source anddestination nodes.

FIG. 4 illustrates an example flow diagram for aligning and comparingsequences in the encoded messages of two episodes of execution of anapplication, with reference to the table of FIG. 5.

Even though the above detailed encoding of the original messagessignificantly reduces the amount of data that needs to be compared,further efficiencies may be required or desired. In accordance with afeature of this invention, instead of comparing each letter in eachencoded message of an episode with each letter in each encoded messageof another episode, sequences of encoded letters (“k-tuples”) arecompared. That is, for example, in the above example sequence of“fbdddfehgcidd”, instead of finding a first “f” in the other episode'sencoded sequence, followed by finding a subsequent “b”, followed by asubsequent “d”, in a preferred embodiment, the comparison process mayinitially attempt to find a 3-tuple “fbd” (first three letters) in theother episode's encoded sequence, followed by a subsequent 3-tuple “ddf”(second set of three letters). Alternatively, the second 3-tuple couldbe “bdd” (second through fourth letters), which would not be asefficient as searching for the next exclusive set of three letters, butwould likely improve the likelihood of finding successful matches.Although this second alternative performs a comparison for each nextletter, the criteria for matching is the occurrence of the samethree-letter sequence in the other episode's message, significantlyreducing “false matches”, as compared to the matching of singlecharacters.

As with the choice of block size, the choice of the size of the k-tupleis generally a tradeoff between efficiency and likelihood of successfulmatches, the likelihood of successful matches being dependent upon thenature of the messages being compared, as well as the size of thealphabet. In a general case, “k” is rarely greater than 8. Thesearch-space (i.e. the span of messages being compared) may also affectthe choice of “k”; if the search-space is small, the value of k may belowered without significantly affecting performance. In a preferredembodiment of this invention, if a search with a given value of k failsto identify any “significant” correlations between the encoded messagesof the episodes being compared, the value of k is reduced and theprocess is repeated.

At 410 of FIG. 4, the k-tuple sequences of two episodes (2 and 3) arecompared, and the coincidences are identified, as illustrated by “X”s inFIG. 5. As illustrated in FIG. 5, the first k-tuple of the encodedmessage of episode 2 does not match the first k-tuple of episode 3, andthe corresponding space 501 is not marked. The fourth k-tuple of episode2 (the fourth column of FIG. 5) is found to match the second k-tuple ofepisode 3 (the second column of FIG. 5), and the corresponding space 502is marked. In like manner, the second k-tuple of episode 2 is found tomatch the third, fourth, sixth, and ninth k-tuples of episode 3, and thecorresponding spaces 503-506 are marked.

The diagonals of FIG. 5 correspond to a sequential series of k-tuplesbetween the episodes. A series of markings along a diagonal indicates acontinual series of coincidences of k-tuple valued between the episodes.That is, the series of markings that form an “island” 510 along thediagonal indicate that the second through fifth k-tuples of episode 2matched the fourth through seventh k-tuples of episode 3, and the island520 indicates that the seventh through ninth k-tuples of episode 2matched the tenth through twelfth k-tuples of episode 3. Such series ofcoincidences between the encoded messages of episodes 2 and 3 indicate ahigh likelihood that the original messages were similar. At 420, thek-tuples along each diagonal are identified and consolidated into suchislands.

At 430, ‘significant’ diagonals are identified, and insignificantdiagonals are removed, to improve the efficiency of subsequentprocesses. Any number of techniques may be used to distinguish betweensignificant and insignificant diagonals. In an example embodiment ofthis invention, the number of coincident k-tuples along each diagonalare counted, and the average and deviation among these counts is noted.Diagonals having a number of coincident k-tuples that is greater thanone standard deviation above the average are considered to besignificant. Additionally, diagonals to the left and right ofsignificant diagonals, within a given window width, are also consideredto be significant. The window width may be user selectable, and may bedependent upon the size of the number of k-tuples being compared; in anexample embodiment, a default window width of 25 is used.

One of skill in the art will recognize that alternative techniques maybe used to distinguish runs of coincidences in k-tuples of encodedmessages of a pair of episodes of an application. For example, insteadof assessing each diagonal independently, one may assess groups ofdiagonals to identify groups that exhibit a higher-than-average numberof coincidences. In this manner, ‘slips’ or ‘gaps’ between sequences ofcoincidences in the episodes may be better accommodated. Similarly,diagonals in the upper-right and lower-left of the coincidence matrixmay be omitted when their length is determined to be too short to allowfor a meaningful number of coincidences. That is, comparing a longsequence of k-tuples that occur at the beginning of one episode with amuch smaller number of k-tuples that occur at the end of the otherepisode can generally be avoided. Other techniques for reducing thenumber of k-tuples that need to be assessed in the subsequent processeswill be evident to one of skill in the art in view of this disclosure.

After eliminating the insignificant diagonals, the remainingcoincidences are assessed to determine sequences of coincidences thatindicate that similar original messages are present in each episode. Ifthere are no significant diagonals, the encoded messages are determinedto be dissimilar, and a next pair of encoded messages is assessed.

Any number of a variety of techniques may be used to identify similarencoded messages. In a relatively simple embodiment of this invention,heuristics may be used to determine that a message in one episodeappears to be similar to a message in the other episode. For example, acount of k-tuple coincidences within coincidence islands of a givenminimum size may be accumulated, and if this count is above a giventhreshold value, the messages may be determined to be sufficientlysimilar to each other.

In a preferred, more robust embodiment, a longest common sequence (LCS)of coincident k-tuples within the encoded messages of the two episodesis determined, at 440. Any number of existing processes may be used todetermine the LCS, although a sparse dynamic programming algorithm wouldgenerally be the most efficient. Examples of such algorithms includeHirschberg, Needleman-Wunsch, and Smith-Waterman.

Initially, with a relatively large value of “k”, the pattern ofcoincident k-tuples is likely to include “gaps” between the coincidenceislands, and the determination of an LCS will be incomplete. To furthercomplete the LCS solution, the value of “k” is reduced, and the process410-440 is repeated for each of the gaps. When no gaps remain, or thevalue of k cannot be reduced beyond 1, this iterative process 450 isterminated, and the determined LCS solution is recorded.

Because the encoding to a limited alphabet may produce the same letterfor different input sequences in the original message, many reportedmatches in the encoded sequence may not correspond to actual matches inthe original messages. Optionally, at 455, the determined LCS solutionmay be filtered to remove such false positives, particularly if certainletters are found to occur more frequently than others. For example, theBaum-Welch or similar algorithm may be used to generate a hidden Markovmodel (HMM), and then the Viterbi or similar algorithm may be applied tothe LCS solution using this HMM to eliminate many of these falsepositives.

After determining the LCS within the encoded messages of episodes 2 and3, the process 410-455 is repeated, using the encoded messages ofepisode 1 and the determined LCS, as indicated at 460 of FIG. 4. Ifother techniques are used to identify similar encoded messages inepisodes 2 and 3, these techniques would be applied to determine whetherthese similar encoded messages also appear in episode 1. For example, ifan accumulated count of coincidences within islands of a given minimumsize is used to identify similar encoded messages in episodes 2 and 3,the encoded messages in episode 1 will be compared to one or both ofthese encoded messages to determine whether any of the messages inepisode 1 also contains a sufficiently high accumulated count.

At 470, the original messages in one or more of the episodescorresponding to the LCS, or corresponding to an otherwise matched setof encoded messages are identified. Optionally, as at 455, thedetermined LCS of the combination of the three episodes may optionallybe filtered to reduce false positives, at 465.

The process 410-470 is repeated for each encoded message in theepisodes, so that each encoded message in episode 2 is compared to eachencoded message in episode 3, then each encoded message in episode 1 iscompared to the LCS or the set of messages that appear to be similar inepisodes 2 and 3.

As noted above, the list of messages that appear to be repeated in eachof the three episodes may be provided to any number of performanceanalysis systems. Application and network timing characteristics may bedetermined by assessing the trace records corresponding to thesemessages. In like manner, the determined LCS, or the determined set ofencoded messages that appear to be similar in all three episodes, may beused to identify messages in subsequent execution episodes of theapplication.

FIG. 6 illustrates an example block diagram of a transaction analysissystem that is suitable for correlating messages in a multi-tier networkenvironment 601 in accordance with this invention. A single controlelement 690 is illustrated as providing control over the other elementsin the system, although distributed control, including manual control,may also be used.

One or more traffic capture devices 610, typically termed “sniffers”,are configured to capture the traffic between select tier-pairs, and tostore some or all of the captured traffic as “traces” 615. These traces615 may be the result of a continuous monitoring of the traffic on theselect tier-pairs, or a collection of discrete traces taken duringdifferent time intervals.

A traffic selector 620 is configured to select particular messages 625between tier-pairs from among the traces 615. In accordance with anaspect of this invention, the selected messages should correspond tomessages that occur between tier-pairs during different executions of a‘target’ application transaction, the tier-pairs corresponding totier-pairs that are likely to communicate messages as a result of theexecution of the target application transaction.

The traffic selector 620 may also be configured to filter the messagesbetween a source and destination of a tier-pair based on events thatoccur during the execution of the application. For example, as detailedabove, messages that could not be related to the application becausethey occur before a first message of the transaction or after a lastmessage of the transaction are not selected for subsequent processing.In like manner, because the subsequent processes are based on thecontent of the monitored messages, the traffic selector 620 may beconfigured to eliminate commonly occurring messages, such asacknowledgement messages, or messages that are likely to be too short toprovide a meaningful comparison result.

A finite-alphabet encoder 630 is configured to encode the selectedmessages between tier-pairs using letters of a finite-alphabet set.Preferably, this encoding results in encoded messages 635 that aresubstantially shorter than the actual messages between the tier-pairs.Typically, the encoder 630 includes a hash function having an outputrange that corresponds to the size of the finite-alphabet set.

A message comparer 640 is configured to compare the encoded messages inone episode to the encoded messages in another episode to identifyencoded messages 645 that appear to be similar. Because the encodedmessages are substantially smaller than the actual messages betweentier-pairs, the time to perform this comparison is substantiallysmaller. Additionally, because the encoding is not unique, and differentinput sequences may produce the same encoded letter, the comparer 640 ispreferably configured to attempt to match sequences (k-tuples) ofencoded letters, rather than individual letters. This further improvesefficiency by reducing the likelihood of identifying spurious matchesthat are merely the result of this many-to-one encoding process.

The comparer 640 identifies coincidences of the same k-tuple appearingin the messages of each of the episodes, and processes thesecoincidences to determine whether an encoded message in one episodeappears to be similar to an encoded message in another episode. In anexample embodiment of this invention, as detailed above, similar encodedmessages are identified by determining a longest common sequence (LCS)occurring in the two messages, and then the messages of another episodeare compared to the determined LCS to determine whether this otherepisode also contains an encoded message corresponding to thisdetermined LCS. One of skill in the art will recognize that any of avariety of techniques are commonly available for comparing sequences,including those developed for comparing DNA and other sequences.

Based on the determination that certain encoded messages appear to becommon among the episodes, the actual messages between the tier-pairscorresponding to these common encoded messages are identified 655, andprovided to other tools that are configured to assess communicationsassociated with the target application.

The foregoing merely illustrates the principles of the invention. Itwill thus be appreciated that those skilled in the art will be able todevise various arrangements which, although not explicitly described orshown herein, embody the principles of the invention and are thus withinthe spirit and scope of the following claims.

In interpreting these claims, it should be understood that:

a) the word “comprising” does not exclude the presence of other elementsor acts than those listed in a given claim;

b) the word “a” or “an” preceding an element does not exclude thepresence of a plurality of such elements;

c) any reference signs in the claims do not limit their scope;

d) several “means” may be represented by the same item or hardware orsoftware implemented structure or function;

e) each of the disclosed elements may be comprised of hardware portions(e.g., including discrete and integrated electronic circuitry), softwareportions (e.g., computer programming), and any feasible combinationthereof.

f) hardware portions may include a processor, and software portions maybe stored on a non-transitory computer-readable medium, and may beconfigured to cause the processor to perform some or all of thefunctions of one or more of the disclosed elements;

g) hardware portions may be comprised of one or both of analog anddigital portions;

h) any of the disclosed devices or portions thereof may be combinedtogether or separated into further portions unless specifically statedotherwise;

i) no specific sequence of acts is intended to be required unlessspecifically indicated; and

j) the term “plurality of” an element includes two or more of theclaimed element, and does not imply any particular range of number ofelements; that is, a plurality of elements can be as few as twoelements, and can include an immeasurable number of elements.

We claim:
 1. A method comprising: capturing a plurality of networktraces, each network trace corresponding to original messagescommunicated between two nodes of a tier-pair during an executionepisode of an application, encoding, by a transaction analysis system,content of some or all of the original messages in each network traceinto letters of a finite-alphabet set to form corresponding encodedmessages, such that a single letter of the finite-alphabet in eachencoded message corresponds to a plurality of bytes in the originalmessage, comparing, by the transaction analysis system, the encodedmessages of a first episode to the encoded messages of a second episodeto identify encoded messages that are similar to each other, andidentifying, by the transaction analysis system, original messages in atleast one of the plurality of network traces corresponding to theencoded messages that are identified as being similar to each other. 2.The method of claim 1, including filtering the network traces toidentify the original messages of the tier-pair based on messagescommunicated between nodes of an other tier-pair.
 3. The method of claim2, including grouping the messages of the other tier-pair into activityintervals, and filtering the network traces based on parametersassociated with these activity intervals.
 4. The method of claim 2,including grouping the original messages of the tier-pair into firstactivity intervals, grouping the messages of the other tier-pair intosecond activity intervals, and filtering the network traces based on acorrespondence in time between the first and second activity intervals.5. The method of claim 4, including scoring the first activity intervalsbased on the correspondences in time and filtering the network tracesbased on the scoring.
 6. The method of claim 5, wherein the scoring isbased on: an overlap in time between the first and second activityintervals, a correspondence in time between a start of the firstactivity interval and a start of the second activity interval, and acorrespondence in time between an end of the first activity interval andan end of the second activity interval.
 7. The method of claim 1,wherein the encoding includes a hashing of the plurality of bytes of theoriginal message.
 8. The method of claim 1, including forming one ormore of the plurality of bytes of the original message based on breakpoints associated with the original message.
 9. The method of claim 8,wherein the break points are based on a structure of the originalmessage.
 10. The method of claim 8, wherein the break points are basedon content of the original message.
 11. The method of claim 1, whereincomparing the encoded messages of the first and second episodes includesforming k-tuples of letters of the first and second encoded messages andcomparing the k-tuples of the first and second encoded messages.
 12. Themethod of claim 11, including comparing the first and second encodedmessages based on k-tuples of a first size, then comparing at leastparts of the first and second encoded messages based on k-tuples of asecond size that is smaller than the first size.
 13. The method of claim11, wherein comparing the first and second encoded messages includescreating a matrix of coincidences between the first and second encodedmessages and assessing coincidences of k-tuples along diagonals of thematrix.
 14. The method of claim 13, wherein assessing the coincidencesincludes accumulating a count of sequential coincidences along thediagonals.
 15. The method of claim 1, wherein comparing the encodedmessages of the first and second episodes includes determining a longestcommon sequence of coincidences of letters in the encoded messages. 16.The method of claim 1, including comparing encoded messages of a thirdepisode of the application to the encoded messages of the first andsecond episodes that are identified as being similar to identify encodedmessages that are similar in the first, second, and third episodes. 17.A method comprising: identifying, at a performance analysis system, ahierarchy of tier-pairs, such that messages at a higher level of thehierarchy have a causal relationship to one or more messages at a lowerlevel of the hierarchy, capturing traces of messages communicated withinthe tier-pairs, identifying, by the performance analysis system,activity intervals at each tier-pair corresponding to sequences ofmessages at each tier-pair, assessing, by the performance analysissystem, the activity intervals at each lower level tier-pair based onparameters associated with activity intervals at a higher leveltier-pair to identify activity intervals at the lower level tier pairthat are potentially related to activity intervals at the higher leveltier pair, and comparing, by the performance analysis system, themessages of activity intervals at the lower level tier-pairs that arepotentially related to activity intervals at the higher level tier pairto identify messages at the lower level tier-pairs corresponding to oneor more activity intervals at a highest level tier-pair.
 18. The methodof claim 17, wherein comparing the messages includes: encoding contentof some or all of the messages into letters of a finite-alphabet set toform corresponding encoded messages, such that a single letter of thefinite-alphabet in each encoded message corresponds to a plurality ofbytes in the original message, and comparing the corresponding encodedmessages.
 19. The method of claim 17, wherein the messages beingcompared at each tier-pair correspond to messages captured duringrepeated executions of an application.
 20. A non-transitorycomputer-readable medium that includes software that, when executed by aprocessor causes the processor to: receive a plurality of networktraces, each network trace corresponding to original messagescommunicated between two nodes of a tier-pair during an executionepisode of an application, encode content of some or all of the originalmessages in each network trace into letters of a finite-alphabet set toform corresponding encoded messages, such that a single letter of thefinite-alphabet in each encoded message corresponds to a plurality ofbytes in the original message, compare the encoded messages of a firstepisode to the encoded messages of a second episode to identify encodedmessages that are similar to each other, and identify original messagesin at least one of the plurality of network traces corresponding to theencoded messages that are identified as being similar to each other.